On Bit-Parallel Processing of Multi-byte Text

نویسندگان

  • Heikki Hyyrö
  • Jun Takaba
  • Ayumi Shinohara
  • Masayuki Takeda
چکیده

There exist practical bit-parallel algorithms for several types of pair-wise string processing, such as longest common subsequence computation or approximate string matching. The bit-parallel algorithms typically use a size-σ table of match bit-vectors, where the bits in the vector for a character λ identify the positions where the character λ occurs in one of the processed strings, and σ is the alphabet size. The time or space cost of computing the match table is not prohibitive with reasonably small alphabets such as ASCII text. However, for example in the case of general Unicode text the possible numerical code range of the characters is roughly one million. This makes using a simple table impractical. In this paper we evaluate three different schemes for overcoming this problem. First we propose to replace the character code table by a character code automaton. Then we compare this method with two other schemes: using a hash table, and the binary-search based solution proposed by Wu, Manber and Myers [25]. We find that the best choice is to use either the automaton-based method or a hash table.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

Techniques in processing text files “as is” are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the “as-is” principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Kore...

متن کامل

Secure and Fast Implementations of Two Involution Ciphers

Anubis and Khazad are closely related involution block ciphers. Building on two recent AES software results, this work presents a number of constant-time software implementations of Anubis and Khazad for processors with a byte-vector shuffle instruction, such as those that support SSSE3. For Anubis, the first is serial in the sense that it employs only one cipher instance and is compatible with...

متن کامل

Study of Japanese Text Compression

The Japanese language has several thousand distinct characters, and the character code length is 16 bits. In such documents the X-bit units are interrelated. Conventional text compression employs g-bit sampling because the compressed object is usually English text. We investigated compression schemes based on ldbit sampling, expecting it improve the compression performance. In Japanese text whe...

متن کامل

A bi-objective model for a scheduling problem of unrelated parallel batch processing machines with fuzzy parameters by two fuzzy multi-objective meta-heuristics

This paper considers a bi-objective model for a scheduling problem of unrelated parallel batch processing machines to minimize the makespan and maximum tardiness, simultaneously. Each job has a specific size and the data corresponding to its ready time, due date and processing time-dependent machine are uncertain and determined by trapezoidal fuzzy numbers. Each machine has a specific capacity,...

متن کامل

Fast parallel CRC algorithm and implementation on a configurable processor

-In this paper we present a fast cyclic redundancy check (CRC) algorithm that performs CRC computation for any length of message in parallel. For a given message with any length, we first chunk the message into blocks, each of which has a fixed size equal to the degree of the generator polynomial. Then we perform CRC computation among the chunked blocks in parallel using Galois Field multiplica...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004